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Abstract 

A common approach in computational science is to use a set of of highly precise but expensive 
calculations to parameterize a model that allows less precise, but more rapid calculations on larger 
scale systems. Least-squares fitting on a model that underfits the data is generally used for this 
purpose. For arbitrarily precise data free from statistic noise, e.g. ah initio calculations, we argue 
that it is more appropriate to begin with a ensemble of models that overfit the data. Within 
a Bayesian framework, a most likely model can be defined that incorporates physical knowledge, 
provides error estimates for systems not included in the fit, and reproduces the original data exactly. 
We apply this approach to obtain a cluster expansion model for the CaZri_j;Ti-r03 solid solution. 



A common approach in computational science is the use of a small number of highly 
precise but expensive calculations to generate data to fit the parameters of a less accurate 
but more computationally tractable "effective model" enabling larger-scale simulations. [H |2] 
A typical example is the fit of a simplified energy model to accurate quantum mechanical 
calculations. |3l HI El El [3, El El [THl [H] Although least-squares minimization is traditionally 
used for this purpose, it is not commonly recognized that this approach implicitly and 
incorrectly assumes that the uncertainty lies in the data rather than in the effective model. 

Here we show that the fact that the model is less accurate than the data can be properly 
taken into account within a Bayesian[T2] framework where the "prior" probability distribu- 
tion of the model parameters characterizes the range of physically plausible models. The 
model parameters are obtained by maximizing the "posterior" distribution provided by 
Bayes rule, given the accurate data and the prior. This approach enables a perfect fit to 
the input noiseless data, while avoiding the usual artifacts of overfitting[TO] and enables the 
seamless inclusion of physical knowledge into the fitting procedure via the prior. Although 
Bayesian methods have a long history in the statistical sciences, (including recent interest 
in Bayesian learning techniques [T3| [T^ ) . the unexpectedly well-behaved limit of completely 
noiseless data we report here has, to our knowledge, not been noted, perhaps because existing 
methods have historically been motivated by the need to fit noisy experimental data rather 
than noiseless calculated data. However, the latter setting clearly deserves more attention. 

While our general theoretical approach should have broad applicability in numerous fields 
of computational sciences, in this Letter, we focus on the specific but broadly applicable 
example of the construction of an efficient energy model for a crystalline alloy. [3] This task 
has immediate applications to thermodynamic modeling of alloys and the determination of 
their phase diagrams, a crucial component of alloy design and optimization. 

In this context, the accurate total energy data are provided by ab initio electronic struc- 
ture methods based upon density functional theory (DFT),[Tni [IS] whose accuracy has 
been thoroughly validated in a wide range of solid-state systems. [T7] (Although such ab 
initio calculations may not provide the exact quantum mechanical result, they are pre- 
cise in that they are virtually free of random errors, as numerical noise is well-controlled 
in modern ab initio software.) The effective model is a so-called cluster expansion 
(CE),[31 HElElElElElIinillllllH] that takes the form of a polynomial in occupation 
variables (described in detail below) indicating which atom lies on each lattice site. The 
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unknown parameters of CE to be determined are the coefficients of this polynomial. The 
CE has been previously shown [TH] to be able, in principle, to exactly represent any pos- 
sible configurational-dependence of the energy, provided that all terms the expansion are 
included, which unfortunately amounts to an infinite number of terms. 

Typically, such CE models are created through a least squares fit to a database of ab 
initio structural energies obtained using a CE truncated to a finite number of terms, so 
that the number of input configurations is larger than the number of unknown parameters. 
This leads to a "truncation problem", where the terms to be retained in the model must 
be determined. Approaches for optimizing the truncation in this context have included 
the cross-validation score minimization [Sj [THl [19], sometimes combined with regularization 
techniques [201 |2T] and conventional Bayesian approaches. [22| [23] 

These approaches treat systematic errors (due to model truncation) and statistical errors 
(due to numerical noise in the data) on an equal footing without exploiting the knowledge 
that statistical errors are, in fact, negligible in this context. In the large sample limit, 
truncation selection methods would eventually "discover" that the statistical noise is zero, 
but considerable improvements are possible if this known fact is explicitly taken into account 
from the start. 

In this Letter, we avoid truncation problems by including many more terms in the effective 
model than the number of ab initio calculations performed. Although the fitting problem is 
underdetermined, it can nonetheless be solved by using Bayesian inference with a physically 
based prior probability distribution for the model coefficients. 

A configuration i in a binary alloy is defined by the occupation of each site of a lattice 
by one of two species, indicated by a spin-like variable Sik = ±1. Each configuration i has 
an associated ab initio energy Ei. These energies can be written in terms of effective cluster 
interactions [6l [T8] : 

a 

where = (rifceQ Sik) is the translationally and rotationally averaged multibody spin cor- 
relation for each symmetry- independent cluster a, while Ja is the associated effective inter- 
action parameter to be determined. 

Bayes' theorem|T2] states that given a prior probability distribution P^^\j) of the un- 
known parameter vector J and one energy observation Ei, the posterior probability of J 
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FIG. 1: (Color online) The ellipsoids schematically represent equiprobability surfaces of a many- 
dimensional prior probability distribution for P^{J). The plane represents the constraints on J 
given by the results of an ab initio total energy calculation. The {N — l)-dimensional ellipsoid 
sliced by the plane gives an equiprobability surface for the posterior distribution of P^^\j); the 
point marked J^^\ represents the most likely solution for J. 

given Ei is 

P{J\E,) <x PiE,\J)P^'\j). 

Since Ei is precisely known, the conditional probability reduces to a delta function 

P{E,\J) = 6{^,J-E,), (1) 

where C,i denotes the row vector of all values of for a fixed i. As a result, P{J\Ei) is 
trivially proportional to S{^iJ — E.i)P^'^\j). By induction, the posterior probability of J 
based on all the energy information Ei, i = 1, ... ,n is 

n 

P^-\j)o,\{5{i,J-E,)P^'\j). 

i=l 

The difference between prior and posterior probabilities is represented geometrically in Fig. [1} 
Each new data point selects a cross-section of the prior distribution corresponding to sets of 
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parameter values that agree perfectly with this data point. Within the intersection of these 
cross-sections, each point has a different posterior probability that is dictated by the prior. 
The most likely model parameters J*^"^ can be determined by maximizing P(")(J) (see Key 
methodological details, at end). This approach selects a unique solution from an otherwise 
underdetermined system of equations, based on the "physical" information provided by the 
prior. 

The width of the posterior provides a measure of the uncertainty remaining in the fitted 
parameter after the data has been incorporated, which can be used, for instance, to access 
the accuracy of predicted energies for any structure not included in the fit. For a Gaussian 
prior, the posterior is Gaussian as well and the most likely parameter values are also the 
expected parameter values, which implies, given the linearity of the cluster expansion, that 
the predicted energies from the CE model will also be expected values. 




FIG. 2: (Color online) (a) Representative CZT structure; (b) Common 80-atom supercell for CZT 
energy calculations highlighted in bold. 

The above procedure was applied to modeling total energies in the CaZri_^Tia,.03 (CZT) 
system, a perovskite solid solution with tilted oxygen octahedra. [21] The structures stud- 
ied were constrained to a common 80-atom supercell shown in Fig. [2| which has 16 per- 
ovskite "B" sites that contain either Zr or Ti. Local density functional theory calcula- 
tions were performed on specific CZT configurations using the VASP[25J code and ultrasoft 
pseudopotentials, [26] , with semicore p electrons treated as valence electrons for Ca, Zr, and 
Ti. A 375 eV plane wave energy cutoff and a 1500 eV cutoff for augmentation charges were 
used. The fc-point mesh was equivalent to an 8 x 8 x 8 Monkhorst-Pack grid for a primitive 
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perovskite cell. The number of symmetrically distinct possible arrangements of Zr and Ti 
and the number of terms in the full cluster expansion are both 2386; all terms are retained 
in the fitting. 

Based on simple physical considerations, the following Gaussian prior was selected 
P(°)(J) = n(v^^a)-'exp ((-J„)V(2^^)) , 

a 

with Wa = Ab^" Y{{i j]caij ij /frain)^'^ ■ Clustcrs with morc sites are expected to have 
smaller coefficients [i.e. h < 1), and clusters with pairs {«, j} of atoms at larger separations 
Tij are expected to have smaller coefficients. The exponent of —2 is motivated by the 
observation that Zr and Ti have the same charge, so interactions cannot be mediated by 
differences in monopole coupling, leaving only dipolar leading terms. We set h = 0.17, based 
on the ratio of the quadratic to linear terms in a Taylor expansion to the energies of three 
trial structures (pure CaZrOa, pure CaTiOs, and rocksalt-ordered CaZri/2Tii/203), and A = 
1.58 eV per atom, based on scaling the error estimates for all 30 structures included in the 
final fit (see Key methodological details). 

Our Bayesian approach is embedded in an iterative scheme where, at each step, a most 
likely solution J is found and error estimates for the other configurations not included in 
the fit are calculated. The configuration with the highest estimated error is then added to 
the fit, after its ab initio total energy is calculated (see Key methodological details). This 
procedure is repeated until a sufficient predictive accuracy has been reached, as illustrated 
in Fig. [3j A monotonic, rapid improvement of the fit is found as the number of structures 
included in the fit increases. For calculation efficiency purposes, one may truncate the terms 
to be retained in the solution to those whose exceeds some small threshold value, with 
little effect on the results (as suggested in |20] in a related context); in our methodology, 
these terms can be retained in estimating errors on predicted energies. 

For thermodynamic applications, such as phase diagram calculations, [10] it is crucial that 
the lowest lying energies have the proper order and energy differences. The least-squares 
fitting procedure does not guarantee this, even if these states were included in the fitting 
procedure. By reproducing the energies of all states included in the fit exactly, our procedure 
avoids this problem. 

While the need to specify a prior in Bayesian methods is often criticized, it should be 
realized that a conventional least-squares fit is not free of a priori assumptions either. In 
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FIG. 3: Convergence of the mean and maximum predicted errors on energies of configurations not 
included in the fit, as a function of the number of configurations included in fit. 

the cluster expansion example, a conventional fit with a user-specified truncation distance 
amounts to a prior which is uninformative (flat) for the included interactions coefficients 
but entirely concentrated at zero for the excluded coefficients. This incorporates physical 
knowledge into the problem, but with a complete certainty that far exceeds what a researcher 
could plausibly know. A smoother prior which gradually concentrates the probability to- 
wards zero coefficient values as the range of interaction increases appears a more appropriate 
description of a priori information. 

If the data does contain some error (for instance, significant numerical noise), then the 
delta function in ([T]) must be replaced by a smooth density and a conventional Bayesian 
methodology would result (as in [22l 123]). If that density is Gaussian, then one recovers a 
so-called ridge regression or Tikhonov regularization[27], a penalized least-squares estimator, 
which has been previously used in the context of CE construction. [201 EI] 

One possible way to select a prior is to estimate a range of likely coefficient values on 
the basis of analogous systems. Good theoretical knowledge of the physics governing the 
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interactions is also useful. [23] Another possibility is to use the data itself to optimize the 
choice of the prior. Although this is not appropriate in a strict Bayesian context, it is 
conceivable that some type of cross-validation approach could be a valid prior-selection 
method in a hybrid Bayesian/frequentist context. This scheme is the noiseless limit of 
an existing variational cross-validation method [20]. Our preliminary results in this vein 
indicate that the method does have the ability to rule out clearly unphysical priors in favor 
of more physically plausible priors. Formally quantifying this method's discriminating power 
represents a fruitful avenue of future investigation. 

The construction of multibody interatomic force fields based upon quantum mechanical 
data [281 [29] represents an important possible application of the methodology proposed here. 
It differs from the alloy problem in that the energy is a function of a continuous set of 
distance variables, rather than a discrete set of spacings on a lattice. While force fields are 
often selected to be nonlinear in the unknown parameters, energy models that are linear in 
the parameters (such as splines) are probably preferable in our context, as they make the 
interaction optimization problem numerically stable and efficient. The lack of physical basis 
for spline functions is alleviated by the possibility of including many more parameters than 
data points and the ability to favor physically plausible interaction shapes via the prior. 
Such an approach could lead to successful force fields models, with defined uncertainties and 
with broad applications in computational physics. 

Key methodological details: With a zero-mean Gaussian prior for the vector J con- 
taining all interactions Jq,, the posterior maximum is given by J{") = W^'^i^W^^y^E, where 
W denotes the variance-covariance matrix of the prior while ^ denotes the matrix with el- 
ements C,ia and E denotes the vector of all n known structural energies Ei. An equivalent 
approach is to perform the changes of variables ^' = ^W^^'^ and J'^"-* = (where 
we use the symmetric square root matrix), and find the minimum Euclidean norm solution 
J'(") to ^'J'(") = E using, for instance, standard singular value decomposition techniques. 

The expected energy of any additional configuration i = n + 1 (with correlations equal to 
the row vector ^(n+i)) is given by J*-"-*. The root mean square error in this energy is sim- 
ply given by |Pr'i"^'C'ri+il) where Pr'_["^ is a projector in the prime space that simultaneously 
projects C,'n+i orthogonal to all ^'j<„. 

If all terms Wa in a Gaussian prior are scaled by the same value x, then: (1) J^"^^ remains 
unchanged; (2) all predicted errors are scaled by x. In an analogous way to the leave-one-out 
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cross validation method[Tn], one can take the set of structures used to obtain J*^"^ , omit 
each single structure i one-at-a-time, and calculate the estimated and actual errors for Ei 
based on fitting the model to all the other structures. A self-consistent value for x is then 
obtained by scaling the mean square estimated errors to equal the actual errors. 
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